googleAuthRsearchConsoleRgoogleAuthRgoogleAnalyticsRgoogleComputeEngineR (Cloudyr)bigQueryR (Cloudyr)googleCloudStorageR (Cloudyr)googleLanguageR (rOpenSci)Slack group to talk around the packages #googleAuthRverse
https://www.rocker-project.org/
Maintain useful R images
rocker/r-verrocker/rstudiorocker/tidyverserocker/shinyrocker/ml-gpuFROM rocker/tidyverse:3.6.0
MAINTAINER Mark Edmondson (r@sunholo.com)
# install R package dependencies
RUN apt-get update && apt-get install -y \
libssl-dev
## Install packages from CRAN
RUN install2.r --error \
-r 'http://cran.rstudio.com' \
googleAuthR \
googleComputeEngineR \
googleAnalyticsR \
searchConsoleR \
googleCloudStorageR \
bigQueryR \
## install Github packages
&& installGithub.r MarkEdmondson1234/youtubeAnalyticsR \
## clean up
&& rm -rf /tmp/downloaded_packages/ /tmp/*.rds \
Flexible No need to ask IT to install R places, just run docker run Cross cloud, ascendent tech
Version controlled No worries new package releases will break code
Scalable Run multiple Docker containers at once, fits into event-driven, stateless serverless future
Continuous development with GitHub pushes
Good for one-off workloads
Pros
Probably run the same code with no changes needed
Easy to setup
Cons
Expensive
May be better to have data in database
3.75TB of RAM: $423 a day
googleCloudStorageR or bigQueryRGood for parallelisable data tasks
Pros
Fault redundency
Forces repeatable/reproducable infrastructure
library(future) makes parallel processing very useable
Cons
Changes to your code for split-map-reduce
Write meta code to handle I/O data and code
Not applicable to some problems
New in googleComputeEngineR v0.3 - shortcut that launches cluster, checks authentication for you
library(googleComputeEngineR)
vms <- gce_vm_cluster()
#2019-03-29 23:24:54> # Creating cluster with these arguments:template = r-base,dynamic_image = rocker/r-parallel,wait =
#FALSE,predefined_type = n1-standard-1
#2019-03-29 23:25:10> Operation running...
...
#2019-03-29 23:25:25> r-cluster-1 VM running
#2019-03-29 23:25:27> r-cluster-2 VM running
#2019-03-29 23:25:29> r-cluster-3 VM running
...
#2019-03-29 23:25:53> # Testing cluster:
r-cluster-1 ssh working
r-cluster-2 ssh working
r-cluster-3 ssh workinggoogleComputeEngineR has custom method for future::as.cluster
# create cluster
vms <- gce_vm_cluster("r-vm", cluster_size = 3)
plan(cluster, workers = as.cluster(vms))
# get data
my_files <- list.files("myfolder")
my_data <- lapply(my_files, read.csv)
# forecast data in cluster
library(forecast)
cluster_f <- function(my_data, args = 4){
forecast(auto.arima(ts(my_data, frequency = args)))
}
result <- future_lapply(my_data, cluster_f, args = 4) Can multi-layer future loops (use each CPU within each VM)
Thanks for Grant McDermott for figuring optimal method (Issue #129)
3 VMs, 8 CPUs each = 24 threads
Clusters of VMs + Docker = Horizontal scaling
Clusters of VMs + Docker + Task controller = Kubernetes
Good for Shiny / R APIs
Pros
Auto-scaling, task queues etc.
Scale to billions
Potentially cheaper
May already have cluster in your organisation
Cons
Needs stateless, idempotent workflows
Message broker?
Minimum 3 VMs
Built on Cloud Build upon GitHub push:
FROM rocker/shiny
MAINTAINER Mark Edmondson (r@sunholo.com)
# install R package dependencies
RUN apt-get update && apt-get install -y \
libssl-dev
## Install packages from CRAN needed for your app
RUN install2.r --error \
-r 'http://cran.rstudio.com' \
googleAuthR \
googleAnalyticsR
## assume shiny app is in build folder /shiny
COPY ./shiny/ /srv/shiny-server/myapp/
Shiny App:
kubectl run shiny1 \
--image gcr.io/gcer-public/shiny-googleauthrdemo:latest \
--port 3838
kubectl expose deployment shiny1 \
--target-port=3838 --type=NodePort
Built on Cloud Buid every GitHub push:
FROM trestletech/plumber
# copy your plumbed R script
COPY api.R /api.R
# default is to run the plumbed script
CMD ["api.R"]
R plumber API:
kubectl run my-plumber \
--image gcr.io/your-project/my-plumber \
--port 8000
kubectl expose deployment my-plumber \
--target-port=8000 --type=NodePort
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: r-ingress-nginx
spec:
rules:
- http:
paths:
- path: /gar/
# app deployed to /gar/shiny/
backend:
serviceName: shiny1
servicePort: 3838
curl 'http://mydomain.com/api/echo?msg="its alive!"'
#> "The message is: its alive!"
A 40 mins talk at Google Next19 with lots of new things to try!
https://www.youtube.com/watch?v=XpNVixSN-Mg&feature=youtu.be
Great video that goes more into Spark clusters, Jupyter notebooks, training using ML Engine and scaling using Seldon on Kubernetes that I haven’t tried yet